# OpenAI Triton Plugin in TensorRT-LLM

This document describes how to build and run a custom plugin leveraging [OpenAI Triton](https://github.com/openai/triton) in TensorRT-LLM.
The workflow can be summarized as follows.
  1. Implement a kernel using Triton in Python.
  2. Compile that kernel using Triton AoT (Ahead-of-Time) compilation tool to generate C files.
  3. Implement a custom TensorRT-LLM plugin to execute the compiled kernel.
  4. Build the TensorRT engine.
  5. It is ready to be executed by TensorRT.

In this example, we show how to create a TensorRT-LLM plugin to wrap a [Fused Attention]((fmha_triton.py)) kernel implemented in OpenAI Triton.
As a prerequisite, it is necessary to have the TensorRT-LLM C++ runtime library.
The instructions to build that library can be found [here](../../README.md#build-from-source).

## 1. Triton AoT Preparation

OpenAI Triton offers an Ahead-of-Time (AoT) compilation tool to generate C files that wrap compiled GPU kernel.
To use the AoT feature, you need a Triton version posterior to the [d0c35b3](https://github.com/openai/triton/commit/d0c35b3b7d6badf0c0d56a821dddab7ace73b4de) commit
and this example has been tested on the [b43c28f](https://github.com/openai/triton/tree/b43c28fdd7a2f95b2e87180cba5d984732120d5c) commit.
```bash
git clone https://github.com/openai/triton
cd triton/python/
git checkout d4644d6cb3ae674e1f15932cac1f28104795744f
pip install cmake && pip install .
cd -
```

For AoT compilation, it is necessary to provide a kernel signature and specify the values of `tl.constexpr` parameters in a comma-separated format.
Details can be found in the [compile.py](https://github.com/openai/triton/blob/main/python/triton/tools/compile.py) file in the Triton project.

Here are examples of kernel AOT compilations for the [Fused Attention](fmha_triton.py) kernel.
```bash
# Kernel for data type=float16, BLOCK_M=128, BLOCK_DMODEL=64, BLOCK_N=128
export TRITON_ROOT=$(pip show triton | grep Location | cut -d' ' -f2)
rm -rf aot
mkdir -p aot/fp16
python ${TRITON_ROOT}/triton/tools/compile.py \
    fmha_triton.py \
    -n fused_attention_kernel \
    -o aot/fp16/fmha_kernel_d64_fp16 \
    --out-name fmha_d64_fp16 \
    -w 4 \
    -ns 2 \
    -s "*fp16:16, *fp32:16, *fp32:16, *fp16:16, *fp16:16, *fp16:16, fp32, i32, i32, i32, 128, 64, 128" \
    -g "(seq_len + 127) / 128, batch_size * num_heads, 1"
# Kernel for data type=float32, BLOCK_M=64, BLOCK_DMODEL=64, BLOCK_N=64
mkdir -p aot/fp32
python ${TRITON_ROOT}/triton/tools/compile.py \
    fmha_triton.py \
    -n fused_attention_kernel \
    -o aot/fp32/fmha_kernel_d64_fp32 \
    --out-name fmha_d64_fp32 \
    -w 4 \
    -ns 2 \
    -s "*fp32:16, *fp32:16, *fp32:16, *fp32:16, *fp32:16, *fp32:16, fp32, i32, i32, i32, 64, 64, 64" \
    -g "(seq_len + 63) / 64, batch_size * num_heads, 1"

# Link generated headers and create dispatchers.
python ${TRITON_ROOT}/triton/tools/link.py aot/fp16/*.h -o aot/fmha_kernel_fp16
python ${TRITON_ROOT}/triton/tools/link.py aot/fp32/*.h -o aot/fmha_kernel_fp32
```
The tool will generate .c and .h files to launch the GPU kernel.
Note that it is necessary to specify the kernel name using the --out-name option, it allows to define dispatcher names for the different data types.
The above invocations will generate `aot/fmha_kernel_{fp16|fp32}.{c|h}` files that contain three functions:
 - the `load_fmha_d64_{fp16|fp32}` function to load the code of the GPU kernel,
 - the `fmha_d64_{fp16|fp32}` function to launch the kernel,
 - the `unload_fmha_d64_{fp16|fp32}` function to unload the GPU kernel.

If GPU resources are limited, it is recommended to adjust the number of stages or warps accordingly. For example, on the V100, the aforementioned arguments might fail due to insufficient shared memory of the GPU. This can be mitigated by reducing the number of stages by one, using `-ns 1`.


## 2. Implement a Custom TensorRT Plugin

This section describes how to implement a custom plugin for TensorRT-LLM to execute the Triton kernel created in the previous section.
We provide an example of plugin implementation.
  - TritonFlashAttentionPlugin([.cpp](TritonFlashAttentionPlugin.cpp), [.h](TritonFlashAttentionPlugin.h)): TensorRT plugin.
  - [plugin.py](plugin.py): Python wrapper.

`TritonFlashAttentionPlugin` is a TensorRT plugin that integrates a Triton kernel generated with the AoT compiler.
The `initialize` and `terminate` functions show how to initialize and terminate the TensorRT plugin.
The `enqueue` member function shows how to call the generated Triton kernel on the GPU.
Note that the name of the Triton kernel depends on the function's signature, meaning that different types or specialization leads a different kernel name.
Thus, if you change an option during AoT compilation like `-s <signature>`, you also have to update file names in CMakeLists.txt in order to match the names generated by the AoT compiler.

To build a shared library for the custom Triton plugin, run:
```bash
mkdir -p build && cd build
cmake .. && make
cd ..
```
As mentioned in the previous section, it is necessary to have the TensorRT-LLM C++ runtime library.
If you want to specify the library paths, run:
```bash
cmake -DTRT_LIB_DIR=</path/to/trt_lib> -DTRT_INCLUDE_DIR=</path/to/trt_headers> -DTRT_LLM_LIB_DIR=</path/to/trt_llm_lib> ..
```
If the build is successful, you should be able to find a shared library for the custom plugin at `build/libtrt_llm_custom_plugins.so`.

A Python wrapper of the Fused Multihead Attention (FMHA) operator and the corresponding TensorRT-LLM layer are implemented in [plugin.py](plugin.py).
It is similar to other TensorRT-LLM operators and layers implemented in [functional.py](../../tensorrt_llm/functional.py) and [layers](../../tensorrt_llm/layers), respectively.
That FMHA operator uses the custom plugin that wraps the functions generated from the Triton kernel.

## 3. Build and Run the TensorRT Engine

We are now ready to build and run the TensorRT engine that uses the Triton kernel.
Here are the two commands to build and run the engine:
```bash
python build.py --num_heads 32 --head_size 64 --max_batch_size 8 --max_seq_len 512 --dtype float16
python run.py --num_heads 32 --head_size 64 --batch_size 8 --seq_len 512 --log_level verbose --benchmark
```

## 4. Known Issues

### 1. A generated dispatcher might not execute a kernel without raising an error due to a missing branch.

The kernel dispatcher written by `link.py` has a missing branch, which can result in returning without executing a kernel.
For instance, in our example, the generated dispatcher looks like this:
```c++
CUresult fmha_d64_fp16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr Out, CUdeviceptr L, CUdeviceptr M, CUdeviceptr Q, CUdeviceptr K, CUdeviceptr V, float sm_scale, int32_t seq_len){
  if ((Out % 16 == 0) && (L % 16 == 0) && (M % 16 == 0) && (Q % 16 == 0) && (K % 16 == 0) && (V % 16 == 0))
    return fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67(stream, gX, gY, gZ, Out, L, M, Q, K, V, sm_scale, seq_len);
}
```
It is recommended to manually update the generated functions by `link.py` to return a proper error for proper error handling.


### 2. The shared memory required by a generated kernel may exceed the hardware limitation.

The AoT compiler does not verify the limitations of shared memory size during compilation time, which could potentially lead to the out-of-resource errors during runtime.
It would be helpful to verify if the requirement of the dynamic shared memory size in a generated kernel exceeds the hardware limitation.
You can find the number at the line of `cuLaunchKernel` call in the generated `.c` file.
For instance, the shared memory size is 114690 bytes in our example.
```c++
CUresult fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr Out, CUdeviceptr L, CUdeviceptr M, CUdeviceptr Q, CUdeviceptr K, CUdeviceptr V, float sm_scale, int32_t seq_len) {
    if (fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67_func == NULL)
       load_fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67();
    void *args[8] = { &Out, &L, &M, &Q, &K, &V, &sm_scale, &seq_len };
    // TODO: shared memory
    if(gX * gY * gZ > 0)
      return cuLaunchKernel(fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67_func, gX, gY, gZ, 4 * 32, 1, 1, 114690, stream, args, NULL);
}
```
It may be resolved by reduing the block size.


### 3. AttributeError: module 'triton' has no attribute 'jit'

This problem may arise if Triton is installed in editable mode. To resolve this issue, please install Triton using the non-editable mode. Refer https://github.com/openai/triton/issues/1693.
